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ABSTRACT 

Over half a century old and showing no signs of aging, 
fc-means remains one of the most popular data process- 
ing algorithms. As is well-known, a proper initialization 
of fc-means is crucial for obtaining a good final solution. 
The recently proposed fc-means++ initialization algorithm 
achieves this, obtaining an initial set of centers that is prov- 
ably close to the optimum solution. A major downside of the 
fc-meatns++ is its inherent sequential nature, which limits its 
applicability to massive data: one must make k passes over 
the data to find a good initial set of centers. In this work we 
show how to drastically reduce the number of passes needed 
to obtain, in parallel, a good initialization. This is unlike 
prevailing efforts on parallelizing fc-means that have mostly 
focused on the post-initialization phases of fc-means. We 
prove that our proposed initialization algorithm fc-meansH 
obtains a nearly optimal solution after a logarithmic num- 
ber of passes, and then show that in practice a constant 
number of passes suffices. Experimental evaluation on real- 
world large-scale data demonstrates that fc-meansll outper- 
forms fc-meEms++ in both sequential and parallel settings. 

1. INTRODUCTION 

Clustering is a central problem in data management and 
has a rich and illustrious history with literally hundreds of 
different algorithms published on the subject. Even so, a 
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single method — fc-means — remains the most popular clus- 
tering method; in fact, it was identified as one of the top f 
algorithms in data mining [34]. The advantage of fc-means 
is its simplicity: starting with a set of randomly chosen ini- 
tial centers, one repeatedly assigns each input point to its 
nearest center, and then recomputes the centers given the 
point assignment. This local search, called Lloyd's itera- 
tion, continues until the solution does not change between 
two consecutive rounds. 

The fc-means algorithm has maintained its popularity even 
as datasets have grown in size. Scaling fe-means to massive 
data is relatively easy due to its simple iterative nature. 
Given a set of cluster centers, each point can independently 
decide which center is closest to it and, given an assignment 
of points to clusters, computing the optimum center can be 
done by simply averaging the points. Indeed parallel imple- 
mentations of fc-means arc readily available (see, for exam- 
ple, cwiki . apache . org/MAHDUT/k-means- clustering . html) 

From a theoretical standpoint, fc-means is not a good clus- 
tering algorithm in terms of efficiency or quality: the run- 
ning time can be exponential in the worst case [32, 4] and 
even though the final solution is locally optimal, it can be 
very far away from the global optimum (even under repeated 
random initializations). Nevertheless, in practice the speed 
and simplicity of fe-means cannot be beat. Therefore, recent 
work has focused on improving the initialization procedure: 
deciding on a better way to initialize the clustering dramati- 
cally changes the performance of the Lloyd's iteration, both 
in terms of quality and convergence properties. 

A important step in this direction was taken by Ostro- 
vsky et al. [30] and Arthur and Vassilvitskii [5] , who showed 
a simple procedure that both leads to good theoretical guar- 
antees for the quality of the solution, and, by virtue of a good 
starting point, improves upon the running time of Lloyd's 
iteration in practice. Dubbed A:-means++, the algorithm se- 
lects only the first center uniformly at random from the data. 
Each subsequent center is selected with a probability pro- 
portional to its contribution to the overall error given the 
previous selections (we make this statement precise in Sec- 
tion 3). Intuitively, the initialization algorithm exploits the 
fact that a good clustering is relatively spread out, thus 
when selecting a new cluster center, preference should be 
given to those further away from the previously selected cen- 
ters. Formally, one can show that fc-means++ initialization 



622 



leads to an 0(logA:) approximation of the optimum [5], or 
a constant approximation if the data is known to be well- 
clusterable [30]. The experimental evaluation of fc-means++ 
initialization and the variants that followed [1, 2, 15] demon- 
strated that correctly initializing Lloyd's iteration is crucial 
if one were to obtain a good solution not only in theory, but 
also in practice. On a variety of datasets fc-means++ initial- 
ization obtained order of magnitude improvements over the 
random initialization. 

The downside of the fc-means++ initialization is its inher- 
ently sequential nature. Although its total running time of 
0{nkd), when looking for a fc-clustering of n points in R'', 
is the same as that of a single Lloyd's iteration, it is not ap- 
parently parallelizable. The probability with which a point 
is chosen to be the ith center depends critically on the real- 
ization of the previous i — 1 centers (it is the previous choices 
that determine which points are away in the current solu- 
tion). A naive implementation of fc-means++ initialization 
will make k passes over the data in order to produce the 
initial centers. 

This fact is exacerbated in the massive data scenario. 
First, as datasets grow, so does the number of classes into 
which one wishes to partition the data. For example, clus- 
tering millions of points into k — 100 or fe = 1000 is typical, 
but a A:-means++ initialization would be very slow in these 
cases. This slowdown is even more detrimental when the 
rest of the algorithm (i.e., Lloyd's iterations) can be imple- 
mented in a parallel environment like MapReduce [13]. For 
many applications it is desirable to have an initialization al- 
gorithm with similar guarantees to fc-meaiis++ and that can 
be efficiently parallelized. 

1.1 Our contributions 

In this work we obtain a parallel version of the fc-means++ 
initialization algorithm and empirically demonstrate its prac- 
tical effectiveness. The main idea is that instead of sampling 
a single point in each pass of the fc-means++ algorithm, we 
sample 0{k) points in each round and repeat the process for 
approximately O(logn) rounds. At the end of the algorithm 
we are left with 0{k logn) points that form a solution that is 
within a constant factor away from the optimum. We then 
recluster these 0{k\ogn) points into k initial centers for the 
Lloyd's iteration. This initialization algorithm, which we 
call fc-meansll , is quite simple and lends itself to easy paral- 
lel implementations. However, the analysis of the algorithm 
turns out to be highly non-trivial, requiring new insights, 
and is quite different from the analysis of fe-means++. 

We then evaluate the performance of this algorithm on 
real-world datasets. Our key observations in the experi- 
ments are: 

• 0(log n) iterations is not necessary and after as little as 
five rounds, the solution of fc-meansll is consistently as 
good or better than that found by any other method. 

• The parallel implementation of fc-meEoislJ is much faster 
than existing parallel algorithms for fc-means. 

• The number of iterations until Lloyd's algorithm con- 
verges is smallest when using fc-meansll as the seed. 



2. RELATED WORK 

Clustering problems have been frequent and important 
objects of study for the past many years by data manage- 
ment and data mining researchers.^ A thorough review of 
the clustering literature, even restricted to the work in the 
database area, is far beyond the scope of this paper; the 
readers are referred to the plethora of surveys available [8, 
10, 25, 21, 19]. Below, we only discuss the highlights directly 
relevant to our work. 

Recall that we are concerned with fe-partition clustering: 
given a set of n points in Euclidean space and an integer 
fc, find a partition of these points into fc subsets, each with 
a representative, also known as a center. There are three 
common formulations of fc-partition clustering depending on 
the particular objective used: fe-center, where the objective 
is to minimize the maximum distance between a point and 
its nearest cluster center, fc-median, where the objective is 
to minimize the sum of these distances, and fc-means, where 
the objective is to minimize the sum of squares of these 
distances. All three of these problems are NP-hard, but a 
constant factor approximation is known for them. 

The fc-means algorithms have been extensively studied 
from database and data management points of view; we dis- 
cuss some of them. Ordonez and Omiecinski [29] studied 
efficient disk-based implementation of fc-means, taking into 
account the requirements of a relational DBMS. Ordonez 
[28] studied SQL implementations of the fc-means to better 
integrate it with a relational DBMS. The scalability issues 
in fc-means are addressed by Farnstrom et al. [16], who 
used compression-based techniques of Bradley et al. [9] to 
obtain a single-pass algorithm. Their emphasis is to initial- 
ize fc-means in the usual manner, but instead improve the 
performance of the Lloyd's iteration. 

The fc-means algorithm has also been considered in a par- 
allel and other settings; the literature is extensive on this 
topic. Dhillon and Modha [14] considered fc-means in the 
message-passing model, focusing on the speed up and scal- 
ability issues in this model. Several papers have studied 
fc-means with outliers; see, for example, [22] and the refer- 
ences in [18]. Das et al. [12] showed how to implement EM 
(a generalization of fc-means) in MapReduce; see also [36] 
who used similar tricks to speed up fc-means. Sculley [31] 
presented modifications to fc-means for batch optimizations 
and to take data sparsity into account. None of these papers 
focuses on doing a non-trivial initialization. More recently, 
Ene et al. [15] considered the fc-median problem in MapRe- 
duce and gave a constant-round algorithm that achieves a 
constant approximation. 

The fc-means algorithms have also been studied from the- 
oretical and algorithmic points of view. Kanungo et al. [23] 
proposed a local search algorithm for fc-means with a run- 
ning time of 0{n^e~'') and an approximation factor of 9-1- e. 
Although the running time is only cubic in the worst case, 
even in practice the algorithm exhibits slow convergence to 
the optimal solution. Kumar, Sabharwal, and Sen [26] ob- 
tained a (1 + e)-approximation algorithm with a running 
time linear in n and d but exponential in fc and -. Os- 
trovsky et al. [30] presented a simple algorithm for finding 
an initial set of clusters for Lloyd's iteration and showed 
that under some data separability assumptions, the algo- 



^ A paper on database clustering [35] won the 2006 SIGMOD 
Test of Time Award. 
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rithms achieve an 0(l)-approxiniation to the optimum. A 
similar method, fc-means++, was independently developed 
by Arthur and Vassilvitskii [5] who showed that it achieves 
an 0(logfe)-approximation but without any assumptions on 
the data. 

Since then, the fc-means++ algorithm has been extended to 
work better on large datasets in the streaming setting. Ailon 
et al. [2] introduced a streaming algorithm inspired by the 
fe-means++ algorithm. Their algorithm makes a single pass 
over the data while selecting O(felogfc) points and achieves 
a constant-factor approximation in expectation. Their al- 
gorithm builds on an influential paper of Guha et al. [17] 
who gave a streaming algorithm for the fc-median problem 
that is easily adaptable to the fc-means setting. Ackermann 
et al. [1] introduced another streaming algorithm based on 
fc-means++ and show that it performs well while making a 
single pass over the input. 

We will use MapReduce to demonstrate the effectiveness 
of our parallel algorithm, but note that the algorithm can be 
implemented in a variety of parallel computational models. 
We require only primitive operations that are readily avail- 
able in any parallel setting. Since the pioneering work by 
Dean and Ghemawat [13], MapReduce and its open source 
version, Hadoop [33], has become a de facto standard for 
large data analysis, and a variety of algorithms have been 
designed for it [7, 6] . To aid in the formal analysis of MapRe- 
duce algorithms, Karloff et al. [24] introduced a model of 
computation for MapReduce, which has since been used to 
reason about algorithms for set cover [11], graph problems 
[27], and other clustering formulations [15]. 

3. THE ALGORITHM 

In this section we present our parallel algorithm for ini- 
tializing Lloyd's iteration. First, we set up some notation 
that will be used throughout the paper. Next, we present 
the background on fc-means clustering and the fc-means++ 
initialization algorithm (Section 3.1). Then, we present our 
parallel initialization algorithm, which we call fc-meansll (Sec- 
tion 3.3). We present an intuition why fc-meansll initializa- 
tion can provide approximation guarantees (Section 3.4); the 
formal analysis is deferred to Section 6. Finally, we discuss 
a MapReduce realization of our algorithm (Section 3.5). 

3.1 Notation and background 

Let X — {xi, . . . , a;„} be a set of points in the d-dimensional 
Euclidean space and let fe be a positive integer specifying 
the number of clusters. Let [[3::^ — a;j|j denote the Euclidean 
distance between Xi and Xj. For a point x and a subset 
y C X of points, the distance is defined as d{x, Y) — 
miuygy \\x — y\\. For a subset Y (Z X oi points, let its 
centroid be given by 

1 



centroid(y) = pr^T ^^ J/ 
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Let C = {ci, . . . , Cfc} be a set of points and let Y (Z X. 
We define the cost of Y with respect to C as 



0y(C) = ^d^(y,C) = ^ 
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we simply denote this as <j). Let (f)* be the cost of the opti- 
mal fc-means clustering; finding (j>* is NP-hard [3]. We call 
a set C of centers to be an a- approximation to fc-means if 
4'x{C) < a(j>* . Note that the centers automatically define a 
clustering of X as follows: the ith cluster is the set of all 
points in X that are closer to c; than any other Cj,j^i. 

We now describe the popular method for finding a locally 
optimum solution to the fc-mesins problem. It starts with a 
random set of fc centers. In each iteration, a clustering of X 
is derived from the current set of centers. The centroids of 
these derived clusters then become the centers for the next 
iteration. The iteration is then repeated until a stable set 
of centers is obtained. The iterative portion of the above 
method is called Lloyd's iteration. 

Arthur and Vassilvitskii [5] modified the initialization step 
in a careful manner and obtained a randomized initializa- 
tion algorithm called fc-means++. The main idea in their 
algorithm is to choose the centers one by one in a controlled 
fashion, where the current set of chosen centers will stochas- 
tically bias the choice of the next center (Algorithm 1). The 
advantage of fc-means++ is that even the initialization step 
itself obtains an (8 log fc)-approximation to (f>* in expectation 
(running Lloyd's iteration on top of this will only improve 
the solution, but no guarantees can be made). The cen- 
tral drawback of fc-means++ initialization from a scalability 
point of view is its inherent sequential nature: the choice of 
the next center depends on the current set of centers. 

Algorithm 1 fe-means++(fe) initialization. 



C <— sample a point uniformly at random from X 
while \C\ < fc do 

Sample x £ X with probability — \!'^i-J- 



C^C\J{x} 
end while 



<*x(C) 



3.2 Intuition behind our algorithm 

We describe the high-level intuition behind our algorithm. 
It is easiest to think of random initialization and fc-means++ 
initialization as occurring at two ends of a spectrum. The 
former selects fc centers in a single iteration according to a 
specific distribution, which is the uniform distribution. The 
latter has fc iterations and selects one point in each iteration 
according to a non-uniform distribution (that is constantly 
updated after each new center is selected). The provable 
gains of fc-means++ over random initialization is precisely 
in the constantly updated non-uniform selection. Ideally, 
we would like to achieve the best of both worlds: an algo- 
rithm that works in a small number of iterations, selects 
more than one point in each iteration but in a non-uniform 
manner, and has provable approximation guarantees. Our 
algorithm follows this intuition and finds the sweet spot (or 
the best trade-off point) on the spectrum by carefully defin- 
ing the number of iterations and the non-uniform distribu- 
tion itself. While the above idea seems conceptually simple, 
making it work with provable guarantees (as fc-means++) 
throws up a lot of challenges, some of which are also clearly 
reflected in the analysis of our algorithm. We now describe 
our algorithm. 



The goal of fc-means clustering is to choose a set C of fc cen- 
ters to minimize (f>x(C); when it is obvious from the context. 
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3.3 Our initialization algorithm: A;-means|| 

In this section we present fc-meansll , our parallel version 
for initializing the centers. While our algorithm is largely 
inspired by A:-means++, it uses an oversampling factor £ — 
r2(fc), which is unlike fc-means++; intuitively, £ should be 
thought of as &{k). Our algorithm picks an initial center 
(say, uniformly at random) and computes ^, the initial cost 
of the clustering after this selection. It then proceeds in log V> 
iterations, where in each iteration, given the current set C of 
centers, it samples each x with probability £d^{x,C)/(f>x{C). 
The sampled points are then added to C, the quantity (j)x (C) 
updated, and the iteration continued. As we will see later. 

Algorithm 2 fc-meansll (fc, £) initialization. 
1: C ^— sample a point uniformly at random from X 
2: V^0x(C) 
3: for 0(logi/>) times do 

4: C' <— sample each point x £ X independently with 
probability p, = "fj^^^^ 

5: C^CUC 
6: end for 

7: For X £ C, set Wx to be the number of points in X closer 

to X than any other point in C 
8: Recluster the weighted points in C into fe clusters 

the expected number of points chosen in each iteration is £ 
and at the end, the expected number of points in C is i? log tp, 
which is typically more than fc. To reduce the number of 
centers. Step 7 assigns weights to the points in C and Step 
8 reclusters these weighted points to obtain k centers. The 
details are presented in Algorithm 2. 

Notice that the size of C is significantly smaller than the 
input size; the reclustering can therefore be done quickly. 
For instance, in MapReduce, since the number of centers 
is small they can all be assigned to a single machine and 
any provable approximation algorithm (such as fc-means++) 
can be used to cluster the points to obtain k centers. A 
MapReduce implementation of Algorithm 2 is discussed in 
Section 3.5. 

While our algorithm is very simple and lends itself to a 
natural parallel implementation (in \ogip rounds'^), the chal- 
lenging part is to show that it has provable guarantees. Note 
that %p < n^A^, where A is the maximum distance among a 
pair of points in X . 

We now state our formal guarantee about this algorithm. 

Theorem 1. If an a- approximation algorithm is used in 
Step 8, then Algorithm k-meansll obtains a solution that is 
an O (a) -approximation to k-means. 

Thus, if fc-means++ initialization is used in Step 8, then 
fc-meansll is an 0(logfc)-approximation. In Section 3.4 we 
give an intuitive explanation why the algorithm works; we 
defer the full proof to Section 6. 

3.4 A glimpse of the analysis 

In this section, we present the intuition behind the proof 
of Theorem 1. Consider a cluster A present in the optimum 
fc-means solution, denote \A\ = T, and sort the points in A 
in an increasing order of their distance to centroid(^): let 

^In practice, our experimental results in Section 5 show that 
only a few rounds are enough to reach a good solution. 



the ordering be oi, . . . , ot. Let qt be the probability that 
at is the first point in the ordering chosen by fc-meansll and 
qT+i be the probability that no point is sampled from cluster 
A. Letting pt denote the probability of selecting at, we have, 
by definition of the algorithm, pt — £d'^{at,C)/(f>x{C). Also, 
since fc-meansll picks each point independently, for any 1 < 
t < T, we have qt = pt Yl^lJiii-Pj), and qr+i = ^-Y.t^i It- 
If at is the first point in A (w.r.t. the ordering) sampled 
as a new center, we can either assign all the points in A to 
at, or just stick with the current clustering of A. Hence, 
letting 
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we have 



E[<j,a{C U C')] < J2 Itst + qr+KpAiC), 
t=i 

Now, we do a mean-field analysis, in which we assume all 
Pt's {1 < t < T) to he equal to some value p. Geometrically 
speaking, this corresponds to the case where all the points 
in A are very far from the current clustering (and are also 
rather tightly clustered, so that all rf(at,C)'s {1 < t < T) 
are equal). In this case, we have qt = p(l — pY^^ , and 
hence {qt}i<t<T is a monotone decreasing sequence. By the 
ordering on ot's, letting 



E 



at 



we have that {st}i 



is an increasing sequence. Therefore 



T T / T T 



where the last inequality, an instance of Chebyshev's sum in- 
equality [20] , is using the inverse monotonicity of sequences 
{qt}i<t<T and {s't}i<t<T- It is easy to see that ^ J2j=i ^'t = 
2(j)\. Therefore, 

E[<Pa{C U C')] < (1 - qT+i)2,p*A + qr+iMC)- 

This shows that in each iteration of fc-meansll , for each 
optimal cluster A, we remove a fraction of (J)a and replace 
it with a constant factor times 0^. Thus, Steps 1-6 of 
fc-meansll obtain a constant factor approximation to fc-means 
after O(log-0) rounds and return 0{£\ogtl}) centers. The al- 
gorithm obtains a solution of size fc by clustering the chosen 
centers using a known algorithm. Section 6 contains the for- 
mal arguments that work for the general case when pt 's are 
not necessarily the same. 

3.5 A parallel implementation 

In this section we discuss a parallel implementation of 
fc-meansll in the MapReduce model of computation. We 
assume familiarity with the MapReduce model and refer the 
reader to [13] for further details. As we mentioned earlier, 
Lloyd's iterations can be easily parallelized in MapReduce 
and hence, we only focus on Steps 1-7 in Algorithm 2. Step 
4 is very simple in MapReduce: each mapper can sample 
independently and Step 7 is equally simple given a set C of 
centers. Given a (small) set C of centers, computing 4ix{C) 
is also easy: each mapper working on an input partition 
X' C X can compute 4>x' (C) and the reducer can simply 
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add these values from all mappers to obtain (j)x{C). This 
takes care of Step 2 and the update to <j)x (C) needed for the 
iteration in Steps 3-6. 

Note that we have tacitly assumed that the set C of cen- 
ters is small enough to be held in memory or be distributed 
among all the mappers. While this suffices for nearly all 
practical settings, it is possible to implement the above steps 
in MapReduce even without this assumption. Each map- 
per holding X' (Z X and C' C C can output the tuple 
(x; argmiUcgc' d{x,c)), where x £ X' is the key. From this, 
the reducer can easily compute d{x,C) and hence (j)x{C). Re- 
ducing the amount of intermediate output by the mappers 
in this case is an interesting research direction. 

4. EXPERIMENTAL SETUP 

In this section we present the experimental setup for eval- 
uating fc-meansll . The sequential version of algorithms were 
evaluated on a single workstation with quad-core 2.5GHz 
processors and 16Gb of memory. The parallel algorithms 
were run using a Hadoop cluster of 1968 nodes, each with 
two quad-core 2.5GHz processors and 16GB of memory. 

We describe the datasets and baseline algorithms that will 
be used for comparison. 

4.1 Datasets 

Wc use three datasets to evaluate the performance of 
fc-meansll . The first dataset, GaussMixture, is synthetic; 
a similar version was used in [5]. To generate the dataset, 
we sampled fc centers from a 15-dimensional spherical Gaus- 
sian distribution with mean at the origin and variance R £ 
{1, 10, 100}. We then added points from Gaussian distribu- 
tions of unit variance around each center. Given the k cen- 
ters, this is a mixture of fc spherical Gaussians with equal 
weights. Note that the Gaussians are separated in terms 
of probability mass — even if only marginally for the case 
R — 1 — and therefore the value of the optimal fe-clustering 
can be well approximated using the centers of these Gaus- 
sians. The number of sampled points from this mixture of 
Gaussians is n = 10, 000. 

The other two datasets considered are from real- world set- 
tings and are publicly available from the UC Irvine Machine 
Learning repository (archive. ics.uci.edu/ml/datasets. 
html). The Spam dataset consists of 4601 points in 58 di- 
mensions and represents features available to an e-mail spam 
detection system. The KDDCup1999 dataset consists of 
4.8M points in 42 dimensions and was used for the 1999 
KDD Cup. We also used a 10% sample of this dataset to 
illustrate the effect of different parameter settings. 

For GaussMixture and Spam, given the moderate num- 
ber of points in those datasets, we use fc £ {20,50,100}. 
For KDDCup1999, we experiment with finer clusterings, 
i.e., we use k e {500, 1000}. The datasets GaussMixture 
and Spam are studied with the sequential implementation of 
fc-meansll , whereas we use the parallel implementation (in 
the Hadoop framework) for KDDCup1999. 

4.2 Baselines 

For the rest of the paper, we assume that each initializa- 
tion method is implicitly followed by Lloyd's iterations. We 
compare the performance of fc-meansll initialization against 
the following baselines: 



• fc-means++ initialization, as in Algorithm 1; 

• Random, which selects k points uniformly at random 
from the dataset; and 

• Partition, which is a recent streaming algorithm for 
fc-means clustering [2], described in Section 4.2.1. 

Of these, fc-means++ can be viewed as the true baseline, 
since fc-meansll is a natural parallelization of it. However, 
fc-means++ can be only run on datasets of moderate size 
and only for modest values of k. For large-scale datasets, 
it becomes infeasible and parallelization becomes necessary. 
Since Random is commonly used and is easily parallelized, we 
chose it as one of our baselines. Finally, Partition is a re- 
cent one-pass streaming algorithm with performance guar- 
antees, and is also parallelizable; hence, we included it as 
well in our baseline and describe it in Section 4.2.1. We now 
describe the parameter settings for these algorithms. 

For Random, the parallel MapReduce/Hadoop implemen- 
tation is standard^. In the sequential setting we ran Random 
until convergence, while in the parallel version, we bounded 
the number of iterations to 20. In general, we observed 
that the improvement in the cost of the clustering becomes 
marginal after only a few iterations. Furthermore, taking 
the best of Random repeated multiple times with different 
random initial points also obtained only marginal improve- 
ments in the clustering cost. 

We use fc-means++ for reclustering in Step 8 of fc-meansll . 
We tested fc-meansll with £ e {O.lfc, 0.5fc, fc, 2fc, lOfc}, with 
r = 15 rounds for the case £ — O.lfc, and r = 5 rounds 
otherwise (running fc-meansll for five rounds when £ — O.lfc 
leads it to select fewer than fc centers with high probabil- 
ity). Since the expected number of intermediate points con- 
sidered by fc-meansll is r£, these settings of the parameters 
yield a very small intermediate set (of size between 1.5fc and 
40fc). Nonetheless, the quality of the solutions returned by 
fc-meansll is comparable and often better than Partition, 
which makes use of a much larger set and hence is much 
slower. 

4.2.1 Setting parameters for Partition 

The Partition algorithm [2] takes as input a parameter m 
and works as follows; it divides the input into m equal-sized 
groups. In each group, it runs a variant of fc-means++ that 
selects 3 log fc points in each iteration (traditional fc-means++ 
selects only a single point). At the end of this, similar to 
our reclustering step, it runs (vanilla) fc-means++ on the 
weighted set of these 3m log fc clusters to reduce the number 
of centers to fc. 

Choosing m = \/n/k minimizes the amount of mem- 
ory used by the streaming algorithm. A neat feature of 
Partition is that it can be implemented in parallel: in the 
first round, groups are assigned to m different machines that 
can be run in parallel to obtain the intermediate set and in 
the second round, fc-means++ is run on this set seq uenti ally. 
In the parallel implementation, the setting m — \fnjk not 
only optimizes the memory used by each machine but also 
optimizes the total running time of the algorithm (ignor- 
ing setup costs), as the size of the instance per machine in 
the two rounds is equated. (The instance size per machine 
is 0{n/m) = OiVnk) which yields a running time in each 
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R= 1 


7?= 10 


i? = 100 




seed 


final 


seed 


final 


seed 


final 


Random 


— 


14 


— 


201 


— 


23,337 


fc-means++ 


23 


14 


62 


31 


30 


15 


/c-means|| 


21 


14 


36 


28 


23 


15 


fc-meansll 
£^2k,r^5 


17 


14 


27 


25 


16 


15 



Table 1: The median cost (over 11 runs) on Gauss- 
MlXTURE with k = 50, scaled down by 10'*. We show 
both the cost after the initialization step (seed) and 
the final cost after Lloyd's iterations {final). 





fc = 20 


fe = 50 


fc = 100 




seed 


final 


seed 


final 


seed 


final 


Random 


— 


1,528 


— 


1,488 


— 


1,384 


fc-means++ 


460 


233 


110 


68 


40 


24 


fc-meansll 

^ = fc/2,r = 5 


310 


241 


82 


65 


29 


23 


fc-meansll 
^ = 2fe,r = 5 


260 


234 


69 


66 


24 


24 



Table 2: The median cost (over 11 runs) on Spam 
scaled down by 10"". We sho^v both the cost after 
the initialization step {seed) and the final cost after 
Lloyd's iterations {final). 



round of O(fc^'^v'n).) Note that this implies that the run- 
ning time of Partition does not improve when the number 
of available machines surpasses a certain threshold. On the 
other hand, fc-meansll 's running time improves linearly with 
the number of available machines (as discussed in Section 
2, such issues were considered in [14]). Finally, notice that 
using this optimal setting, the expected size of the interme- 
diate set used by Partition is sVnklogk, which is much 
larger than that obtained by fc-meansll . For instance. Ta- 
ble 5 shows that the size of the coreset returned by fc-meansll 
is smaller by three orders of magnitude. 





fc = 500 


fc = 1000 


Random 


6.8 X 10' 


6.4 X 10' 


Partition 


7.3 


1.9 






O.lfc 


5.1 


1.5 






0.5fc 


19 


5.2 


fc-meansll 


£^ 


fc 


7.7 


2.0 






2fc 


5.2 


1.5 






lOfc 


5.8 


1.6 



Table 3: Clustering cost (scaled dow^n by 10^") for 
KDDCUP1999 for r = 5. 



5. EXPERIMENTAL RESULTS 

In this section we describe the experimental results based 
on the setup in Section 4. We present experiments on both 
sequential and parallel implementations of fc-meansll . Recall 
that the main merits of fc-meansll were stated in Theorem 1: 
(i) fc-mecms|| obtains as a solution whose clustering cost is on 
par with fc-means++ and hence is expected to be much bet- 
ter than Random and (ii) fc-meansll runs in a fewer number of 
rounds when compared to fc-means++, which translates into 
a faster running time especially in the parallel implementa- 
tion. The goal of our experiments will be to demonstrate 
these improvements on massive, real-world datasets. 

5.1 Clustering cost 

To evaluate the clustering cost of fc-meansll , we compare it 
against the baseline approaches. Spam and GaussMixture 
are small enough to be evaluated on a single machine, and we 
compare their cost to that of fc-meansll for moderate values 
of fc e {20,50, 100}. We note that for fc > 50, the centers 
selected by Partition before reclustering represent the full 
dataset (as sVnk log k > n for these datasets), which means 
that results of Partition would be identical to those of 
fc-means++. Hence, in this case, we only compare fc-meansll 
with fc-means++ and Random. KDDCuPl999 is sufficiently 
large that for large values of fc G {500, 1000}, fc-means++ is 
extremely slow when run on a single machine. Hence, in 
this case, we will only compare the parallel implementation 
of fc-meansll with Partition and Random. 

We present the results for GaussMixture in Table 1 and 
for Spam in Table 2. For each algorithm we list the cost of 
the solution both at the end of the initialization step, before 
any Lloyd's iteration and the final cost. We present two pa- 
rameter settings for fc-meansll ; we will explore the effect of 
the parameters on the performance of the algorithm in Sec- 
tion 5.3. We note that the initialization cost of fc-meansll is 
typically lower than that of fc-means++. This suggests that 



the centers produced by fc-meansll avoid outliers, i.e., points 
that "confuse" fc-means++. This improvement persists, al- 
though is not as pronounced if we look at the final cost of 
the clustering. In Table 3 we present the results for KD- 
DCup1999. It is clear that both fc-meansll and Partition 
outperform Random by orders of magnitude. The overall cost 
for fc-meansll improves with larger values of £ and surpasses 
that of Partition for l> k. 

5.2 Running time 

We now show that fc-meansll is faster than Random and 
Partition when implemented to run in parallel. Recall that 
the running time of fc-meansll consists of two components: 
the time required to generate the initial solution and the 
running time of Lloyd's iteration to convergence. The former 
is proportional to both the number of passes through the 
data and the size of the intermediate solution. 

We first turn our attention to the running time of the ini- 
tialization routine. It is clear that the number r of rounds 
used by fc-meansll is much smaller than that by fc-means++. 
We therefore focus on the parallel implementation and com- 
pare fc-meansll against Partition and Random. In Table 4 
we show the total running time of these algorithms. For var- 
ious settings of I, fc-meansll runs much faster than Random 
and Partition. 





fc = 500 


fc = 1000 


Random 


300.0 


489.4 


Partition 


420.2 


1,021.7 






O.lfc 


230.2 


222.6 






0.5fc 


69.0 


46.2 


fc-meansll 


£ = 


fc 


75.6 


89.1 






2fc 


69.8 


86.7 






lOfc 


75.7 


101.0 



Table 4: Time (in minutes) for KDDCuPl999. 
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While one can expect fc-meansll to be faster than Random, 
we investigate the reason why fc-meansll runs faster than 
Partition. Recall that both fe-meansll and Partition first 
select a large number of centers and then recluster the cen- 
ters to find the k initial points, fn Table 5 we show the total 
number of intermediate centers chosen both by fe-meansll 
and Partition before reclustering on KDDCup1999. We 
observe that fe-meansll is more judicious in selecting cen- 
ters, and typically selects only 10-40% as many centers as 
Partition, which directly translates into a faster running 
time, without sacrificing the quality of the solution. Select- 
ing fewer points in the intermediate state directly translates 
to the observed speedup. 





fc = 500 


fe = 1000 


Partition 


9.5 X 10^ 


1.47 X 10" 


fc-meansll , £ — 


O.lfc 

0.5fc 

fc 

2fc 

lOfc 


602 

591 

1,074 

2,321 

9,116 


1,240 
1,124 
2,234 
3,604 

7,588 



Table 5: Number of centers for KDDCuPl999 before 
the reclustering. 

We next show an unexpected benefit of fc-meansll : initial 
solution found by fc-meansll leads to a faster convergence of 
the Lloyd's iteration. In Table 6 we show the number of 
iterations to convergence of Lloyd's iterations for different 
initializations. We observe that fe-meansll typically requires 
fewer iterations than fc-means++ to converge to a local op- 
timum, and both converge significantly faster than Random. 





fe = 20 


fc = 50 


k = 100 


Random 


176.4 


166.8 


60.4 


fc-means++ 


38.3 


42.2 


36.6 


fe-meansll 

£ = 0.5fc,r = 5 


36.9 


30.8 


30.2 


fe-meEtnsJI 

£ = 2fc,r = 5 


23.3 


28.1 


29.7 



show the result on a 10% sample of KDDCuPl999, with 
varying values of fe 

When £ = k and the algorithm selects exactly fe points, 
we can see that the final clustering cost (after completing 
the Lloyd's iteration) is monotonically decreasing with the 
number of rounds. Moreover, even a handful of rounds is 
enough to substantially bring down the final cost. Increasing 
£ to 2fc and 4fc, while keeping the total number of rounds 
fixed leads to an improved solution, however this benefit 
becomes less pronounced as the number of rounds increases. 
Experimentally we find that the sweet spot lies when r « 8, 
and oversampling is beneficial for r < 8. 

In the next set of experiments, we explore the choice of 
£ and r when the sampling is done with replacement, as in 
specifications of fc-meansll . Recall we can guarantee that the 
number of rounds needs to be at most 0{\ogtl}) to achieve 
a constant competitive solution. However, in practice a 
smaller number of rounds suffices. (Note that we need at 
least k/£ rounds, otherwise we run the risk of having fewer 
than fe centers in the initial set.) 

In Figure 5.2 and Figure 5.3, we plot the cost of the final 
solution as a function of the number of rounds used to ini- 
tialize fe-meansll on GaussMixture and Spam respectively. 
We also plot the final potential achieved by fc-means++ as 
point of comparison. Observe that when r ■ £ < k, the so- 
lution is substantially worse than that of fe-means++. This 
is not surprising since in expectation fe-meansll has selected 
too few points. However as soon as r ■ £ > k, the algorithm 
finds as good of an initial set as that found by fc-means++. 

6. ANALYSIS 

In this section we present the full analysis of our algo- 
rithm, which shows that in each round of the algorithm 
there is a significant drop in cost of the current solution. 
Specifically, we show that the cost of the solution drops by 
a constant factor plus 0{4>*) in each round — this is the key 
technical step in our analysis. The formal statement is the 
following. 



Theorem 2. Let a 



Table 6: Number of Lloyd's iterations till conver- 
gence (averaged over 10 runs) for Spam. 



exp(^-(l-e-^/(2fe))^ ^e-A. // 

C is the set of centers at the beginning of an iteration of 
Algorithm 2 and C' is the random set of centers added in 
that iteration, then 



5.3 Trading-off quality with running time 

By changing the number r of rounds, fc-meansll interpo- 
lates between a purely random initialization of fc-means and 
the biased sequential initialization of fc-means++. When 
r = all of the points are sampled uniformly at random, 
simulating the Random initialization, and when r = fc, the 
algorithm updates the probability distribution at every step, 
simulating fc-means++. In this section we explore this trade- 
off. There is an additional technicality that we must be 
cognizant of: whereas fe-means++ draws a single center from 
the joint distribution induced by D^ weighting, fc-meansll se- 
lects each point independently with probability proportional 
to D^, selecting £ points in expectation. 

We first investigate the effect of r and I on clustering 
quality. In order to reduce the variance in the computations, 
and to make sure have exactly £ ■ r points at the end of the 
point selection step, we begin by sampling exactly £ points 
from the joint distribution in every round. In Figure 5.1 we 



E[<t)xiCuC')]< 



+ 



-4>x{C). 



Before we proceed to prove Theorem 2, we consider its fol- 
lowing simple corollary. 

Corollary 3. // (jr^' is the cost of the clustering after 
the ith round of Algorithm 2, then 



E\ 



aWi 



< 



v>- 



16 



Proof. By an induction on i. The base case i = is 
trivial, as ^^'^' = tp. Assume the claim is valid up to i. 
Then, we will prove it for i -I- 1. From Theorem 2, we know 
that 



EV 



80*. 



By taking an expectation over (j)^ '' , we have 



El 



)]<^S[0«]+8r 



628 



KDD Dataset, k=1 7 



KDD Dataset, k=33 



1e+16 

1e+15 

S 1e+14 

o 

1e+13 

1e+12 



1e+16 
1e+15 
1e+14 
1e+13 
1e+12 
1e+11 



l/k=1 


1 


; 


; i/k=2 




. 


r\ l/k=4 




- 


1 


10 




log # Rounds 






KDD Dataset, k=65 






l/k=1 




■ 


\ l/k=2 




■ 


;,X l/k=4 




■ 


^ '"\\ 




-. 


'r ''\\ 




-. 


■.--r-n-T-.^n-H --^- 


=i~^ 


_ 




1e+16 
1e+15 
1e+14 
1e+13 
1e+12 
1e+11 
1e+10 



10 
log # Rounds 

KDD Dataset, k=129 



1 

l/k= 

r l/k= 


1 

=1 ■ 

=2 . 


N<, l/k= 


=4 


r\ 


~ 


^ '\\ 


-■ 


- ''■$\ 


-■ 


'- '••■^\. 


1 


1 



10 
log # Rounds 



10 
log # Rounds 



100 



Figure 5.1: The effect of different values of £ and the number of rounds r on the final cost of the algorithm 
for a 10% sample of KDDCup1999. Each data point is the median of 11 runs of the algorithm. 



By the induction hypothesis on E[(f)^^'], we have 

i.+i 



E[(j}'-'+'>] < 



1 + a 



1 + Q 



^ + 8(i±^ + l 



1-a 



i+i 



i>- 



16 

1-Q 



0*. D 



Corollary 3 implies that after O(logV') rounds, the cost of 
the clustering is 0((^*); Theorem 1 is then an immediate 
consequence. We now proceed to establish Theorem 2. 

Consider any cluster A with centroid(^) in the optimal 
solution. Denote \A\ — T and let ai, . . . ,aT be the points 
in A sorted increasingly with respect to their distance to 
centroid(yl). Let C" denote the set of centers that are se- 
lected during a particular iteration. For 1 < i < T, we 
let 

qt = PrK G C, a, ^ C", V 1 < j < i] 

be the probability that the first t—1 points {ai, . . . , at-i} are 
not sampled during this iteration and at is sampled. Also, 
we denote by qr+i the probability that no point is sampled 
from cluster A. 

Furthermore, for the remainder of this section, let D{a) — 
d{a,C), where C is the set of centers in the current iteration. 
Letting pt denote the probability of selecting at, we have. 



by definition of the algorithm, pt = — 4^ . Since fc-meansll 
picks each point independently, using the convention that 
Pt+1 = 1, we have for all 1 < i < T + 1, 



Qt ^PtYl{l-Pj)- 



The main idea behind the proof is to consider only those 
clusters in the optimal solution that have significant cost rel- 
ative to the total clustering cost. For each of these clusters, 
the idea is to first express both its clustering cost and the 
probability that an early point is not selected as linear func- 
tions of the qt's (Lemmas 4, 5), and then appeal to linear 
programming (LP) duality in order to bound the clustering 
cost itself (Lemma 6 and Corollary 7). To formalize this 
idea, we start by defining 



St 



I aeA 



■ at 



for all 1 < t < T, and st+i = 4>a- Then, letting 0^ = 
(J>a{CUC') be the clustering cost of cluster A after the current 
round of the algorithm, we have the following. 
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Figure 5.2: The cost of fc-meansll followed by Lloyd's iterations as a function of the number of initialization 
rounds for GaussMixture. 



Lemma 4. The expected cost of clustering an optimum 
cluster A after a round of Algorithm 2 is bounded as 



Thus 



T+l 
E[<t>'A] < Y. 1^"^ 



(6.1) 



Proof. We can rewrite the expectation of the clustering 
cost <j)'ji^ for cluster A after one round of the algorithm as 
follows: 

T 

E[^'a] =Y,qtE[(k'A I at e C",a, ^ C',V1 <i < t]. (6.2) 
t=i 

Observe that conditioned on the fact that at G C , we can 
either assign all the points in A to center at, or just stick 
with the former clustering of A, whichever has a smaller 
cost. Hence 

E[<P'a I at €C',a, ^ C',V1 < j < t] < st. 

and the result follows from (6.2). D 

In order to minimize the right hand side of (6.1), we want 
to be sure that the sampling done by the algorithm places 
a lot of weight on qt for small values of t. Intuitively this 
means that we are more likely to select a point close to the 
optimal center of the cluster than one further away. Our 
sampling based on -D^(-) implies a constraint on the prob- 
ability that an early point is not selected, which we detail 
below. 

Lemma 5. Let rjo = 1, and, for any 1 < t < T, rjt — 



n-.i( 



4>A 



(1 - qT+i)] ■ Then, for anyO<t<T, 

T+l 

Y qr < rit. 

r = t + l 



Proof. First note that qr+i = Ylt=ii^ — Pt) > 1 — 
J2t^iPt- Therefore, 



1 - qr+i <y^^Pt = £ 



(pA 



W'jat) ^ D\at) ,, . 

Pt = -^ > , (1 - qr+i) 



To prove the lemma, by the definition of g^ we have 

T+l / t \ T+l 7 — 1 

E <ir=iu(^-p^))- E n (i-p^)p^ 



r = t + l \j = l 



r = t + lj = t+l 



<U(^-Pj) 

= Vt- □ 

Having proved this lemma, we now slightly change our 
perspective and think of the values qt {1 < t < T + 1) as 
variables that (by Lemma 5) satisfy a number of linear con- 
straints and also (by Lemma 4) a linear function of which 
bounds _B[(/>^]. This naturally leads to an LP on these vari- 
ables to get an upper bound on _E[(^^]; see Figure 6.1. We 
will then use the properties of the LP and its dual to prove 
the following lemma. 

Lemma 6. The expected potential of an optimal cluster A 
after a sampling step m Algorithm 2 is bounded as 



E[<f>'A] < (l - qT+i)Y, 



D'iat] 



St + rjT(l>A- 



Proof. Since the points in A are sorted increasingly with 
respect to their distances to the centroid, letting 



E 



\a — at\ 



(6.3) 



< s'rp. Hence, since 



for 1 < t < T, we have that si < 

St = rmn{(f>A, s't}, we also have si < • • • < st < st+i. 

Now consider the LP in Figure 6.1 and its dual. Since st 
is an increasing sequence, the optimal solution to the dual 
must have at = st+i — st (letting so = 0). Then, we can 
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Figure 5.3: The cost of fc-meansll followed by Lloyd's iterations as a function of the number of initialization 
rounds for Spam. 



T+l 

max y^ qtSt 

<31v,9T+l f— J* 

subject to 

T+l 

Y, 1r< Vt, (VO < i < T) 

<3t > 0. (VI < t < T + 1) 



mm 

aQ....,aj- 



t=0 

subject to 

t-i 

^a, >st, (Vl<t<T + l) 

r=0 

at > 0. (VO < t < T) 
Figure 6.1: LP (left) and its dual (right). 



bound the value of the dual (and hence the value of the 
primal, which by Lemma 4 and Lemma 5 is an upper bound 
on E[(I>'jh]) as follows: 



E[(I>'a] < ^VtOit 
t=o 

T 
= I] St (77, 



t=l 

T 

E 



stVt-i 



•t-i — r/t) + rjTST+i 
D^{at 



(1 — QT+l) +r]TST+l 



<il-qT+l)J2 



D\at) 



St + rjT(l>A, 



where the last step follows since rjt < 1. D 
This results in the following corollary; 

Corollary 7. 

EWa] < 80i(l - qr+i) + (t>Ae-^^-'^+'\ 

Proof. By the triangle inequality, for all a, at we have 
D{at) < D{a) + \\a — at\\. The power-mean inequality then 



implies that D^{at) < 2D^{a) + 2jja — at|p. Summing over 
all a G A and dividing by (pA, we have that 



A A ~ T T (J>a' 



where s't is defined in (6.3). Hence, 



t=i ^ t=i ^ 

2 V^ , 2 >-^ s'tSt 

T T 

< ^ E ^* + r E ^^ 



where in the last inequality, we used st < s't for the first 
summation, and St < 4>a for the second one. 
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inally, noticing 




77t = 1 1 

i=A 


(t>A 


< exp - 
V 





(l-5T+l)j 

(1 - qr+i) 

= exp(-(l-(7T+i)), 
the proof follows from Lemma 6. D 

We are now ready to prove the main result. 

Proof of Theorem 2. Let Ai,. . . ,Akhe the clusters in 
the optimal solution C^- . We partition these clusters into 
"heavy" and "light" as follows: 



c«^M.c,Hf >4 



and 



li4>A 



Cl ~ C^* \ Ch- 
Recall that 

«?T+i = ]^(1 - Pj) < exp I - ^pj J = cxp 

Then, by Corollary 7, for any heavy cluster A, we have 

EWa] < 8<P*a(1 - QT+i) + <^Ae-<^-'^+i) 

<8<^i + exp(-(l-e-^/2'=))0A 
— 8<j}A + a<j)A- 
Summing up over all A G Ch, we get 

Then, by noting that 

J, ^ 't' \r \ ^ 't' I. "^ 
^ "- - 2k ^ ^ - 2k 2' 

and that E[<j}'(;^] < 4>Cl, we have 

E[(f>'] < 8(t>*c^ + a4>c^ + <Pcr, 

< 8<^S„ + (a + (1 - a)/2)0 
<80* + (a + (l-a)/2)0. D 



7. CONCLUSIONS 

In this paper we obtained an efficient parallel version 
fc-meansll of the inherently sequential fe-means++. The algo- 
rithm is simple and embarrassingly parallel and hence ad- 
mits easy realization in any parallel computational model. 
Using a non-trivial analysis, we also show that fc-meansll 
achieves a constant factor approximation to the optimum. 
Experimental results on large real-world datasets (on which 
many existing algorithms for fc-means can grind for a long 
time) demonstrate the scalability of fc-meaais|| . 

There have been several modifications to the basic fc-means 
algorithm to suit specific applications. It will be interest- 
ing to see if such modifications can also be efficiently paral- 
lelized. 
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